Women's E-Commerce Clothing Reviews

About Dataset

Context


This is a Women’s Clothing E-Commerce dataset revolving around the reviews written by customers. Its nine supportive features offer a great environment to parse out the text through its multiple dimensions. Because this is real commercial data, it has been anonymized, and references to the company in the review text and body have been replaced with “retailer”.

Content

This dataset includes 23486 rows and 10 columns. Each row corresponds to a customer review, and includes the variables:

What was done in this notebook?

Outlines

1. Import Necessary Libraries

Read Dataframe, and explore data's shape & distribution of missing values & info:

Cean data by dropping and creating new column.

2. Univariate Analysis

We got 8 columns: 5 numeric columns & 3 categorical columns.


Mainly two parts in the section: Analysis on numeric columns; Analysis on categorical columns.

2.1. Explore Numeric Columns

Numeric Columns in the DataFrame:

Explore numeric columns in the dataset by plotting distplot, barplot.

2.1.1. Age Distribution

Age of customer mostly distributed in the range of 30 to 49

2.1.2. Rating & Recommended IND

1.Rating: 4 and 5 rates account about 77% of rating.

2.Recommended IND: About 82% products are recommended in the dataset.

2.1.3. Positive Feedback Count

Most customers rarely consider a review as positive.

Regardless of whether it is a positive or negative review, there is a higher distribution of comments with a word count of 500 or more.

2.2. Explore Categorical Columns

Categorical Columns in the dataset:

Explore categorical columns by plotting treemap, barplot and word treemap.

2.2.1. Department Name & Class Name

Those two columns have inclusion relationship so plot treemap on Department & Class:

In the Department, Tops, Dresses, and Bottoms make up the majority of the products, while in the Class, Dresses, Knits, and Blouses dominate.

2.2.2. Text Frequency

Explore Text by plotting bar plot and generating Word Treemap.

From above, it can be seen that top words are: "dress", "fit", "love"

3. Bivariate Analysis

Analysis on "relationship of Department between Age and Rate Scores" & "relationship of Rate Scores between Recommended IND

3.1. Department by Age

Given bar plot above, can observe:

3.2. Department by Rate

Given bar plot above, can observe

Given bar plot above, can observe

Recommended comments largely contain positive terms such as 'great', 'fit', 'comfortable' 'beautiful', and 'nice'.

4. Text Preprocessing

4.1. Define Preprocessing Function

4.2. Tokenization, Sequencing and Padding

5. Text Classification (BERT Model)

5.1. Define Model

5.2. Train Model

5.3. Model Inference

5.3.1. Restore Model

5.3.2. Compare Each Scores : Accuracy, Precision, Recall, F1, AUC